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Abstract 

Building unified timelines from a collection of 
written news articles requires cross-document 
event coreference resolution and temporal re¬ 
lation extraction. In this paper we present an 
approach event coreference resolution accord¬ 
ing to: a) similar temporal information, and 
b) similar semantic arguments. Temporal in¬ 
formation is detected using an automatic tem¬ 
poral information system (TIPSem), while se¬ 
mantic information is represented by means 
of LDA Topic Modeling. The evaluation of 
our approach shows that it obtains the highest 
Micro-average F-score results in the SemEval- 
2015 Task 4: “TimeLine: Cross-Document 
Event Ordering” (25.36% for TrackB, 23.15% 
for SubtrackB), with an improvement of up to 
6% in comparison to the other systems. How¬ 
ever, our experiment also showed some draw¬ 
backs in the Topic Modeling approach that de¬ 
grades performance of the system. 


to the same event or fact and occur at the same mo¬ 
ment. Our approach attempts to formalize the idea 
that two or more event mentions co-refer if they have 
not only temporal compatibility (the events occur at 
the same time) but also semantic compatibility (the 
event mentions refers to the same facts, location, en¬ 
tities, etc.). 

Of a set of event mentions in one or more texts, 
our proposal groups together the event mentions that 

(i) have the same or a similar temporal reference, 

(ii) have the same or a similar event head word, and 

(iii) whose main arguments refer to the same or sim¬ 
ilar topics. In order to evaluate the system, we have 
participated in the SemEval-2015 Task 4 “TimeLine: 
Cross-Document Event Ordering”. 

In the following sections we will present the the¬ 
oretical background to our approach (section |2l) and 
the main technical aspects (sections [3] and H]). Then 
we will present the results obtained (section [5]l and 
some conclusions. 


1 Introduction 

Since access to knowledge is crucial in any domain, 
connecting and time-ordering the information ex¬ 
tracted from different documents is a very important 
task. The goal of this paper is therefore to build or¬ 
dered timelines for a set of events related to a tar¬ 
get entity. In doing so, our approach is dealing with 
two problems: a) cross-document event coreference 
resolution and b) cross-document temporal relation 
extraction. 

In order to arrange event mentions in a timeline it 
is necessary to know which event mentions co-refer 


2 Background 

Two or more event mentions co-refer when they re¬ 
fer to the same real fact or event. Two events can 
denote the same fact whereas the linguistic mentions 
have a different syntax structure, different words, or 
even a different meaning. Whatever the case may be, 
both event mentions must be semantically related. 

An event mention is formed of an event head 
(usually a verb or a deverbal noun) that is re¬ 
lated to a semantic structure (linguistically rep¬ 
resented as an argument structure with an agent, 
patient, theme, instrument, etc., that is, the se¬ 
mantic roles) in which there are some event 


participants (entities) and which is located in 


place and time (Levin and Rappaport-Hovav, 2005 


IHovav et ah, 20 lO] ). The meaning of an event men¬ 
tion is therefore not only the meaning of the event 
head, but also the compositional meaning of all the 
components and their relations: head, participants, 
time, place, etc. 

In order to detect this semantic relation between 
event mentions, previous papers have isolated the 
main components of the event structure. For in¬ 
stance, Cybulska and Vossen (120131) apply an event 
model based on four components: location, time, 
participant and action. Moreover, with regard to 
temporal information, only explicit temporal ex¬ 
pressions that appears in the text are considered, 
but no temporal information is inferred by navigat¬ 
ing temporal links. Bejan and Harabagiu (120141) 
use a rich set of linguistic features to model the 
event structure, including lexical features such as 
head word and lemmas, class features such as PoS 
or event class, semantic features such as WordNet 
sense or semantic roles frames, etc. They use an 
unsupervised approach based on a non-parametrical 
Bayesian model. 


3 Our Approach 

In our approach we represent each event mention 
as a head word (the event tag in the TimeML 
dSauri et ah, 2006 ) annotation scheme) related to a 
temporal expression (implicit or explicit), a set of 
entities (0 or more), and a set of topics that repre¬ 
sents what the event mention is referring to. This 
paper is focused on temporal information process¬ 
ing and topic-based semantic representation. 

3.1 Temporal Information Processing 

The TimeML dSauri et ah, 2006| ) annotation scheme 
has now been adopted as a standard by a large 
number of researchers in the field of temporal 
information annotation. It represents not only 
events and temporal expressions, but also links 
dPustejovsky et ah, 2003 [ ) 

A manual annotation of event mentions and the 
DCT of texts have been considered as an input of 
the system, and an automatic system has been used 
to perform the annotation with temporal expres¬ 
sions and temporal links in order to be able to es¬ 


tablish a complete timeline of the input texts. If 
a plain text is considered, systems such TIPSem 
(Temporal Information Processing using Semantics) 
dLlorens et ah, 20131 [Llorens et ah, 2012] i| are able 
to automatically annotate all the temporal expres¬ 
sions (TIMEX3), events (EVENT) and links be¬ 
tween them. 

Once the temporal links have been established, all 
the specific temporal information for each event is 
inferred by means of temporal links navigation. This 
information allows us to determine temporal com¬ 
patibility between all the events considered. 

3.2 Topic-based Semantic Representation 

The meaning of each event structure has been rep¬ 
resented by using Topic Modeling dBlei, 201^ on 
a reference corpus. Topic modeling is a fam¬ 
ily of algorithms that automatically discover topics 
from a collection of documents. More specifically, 
we apply the Eatent Dirichlet Allocation (EDA) 
dBlei et ah, 2003 1 ), which follows a bottom up ap¬ 
proach. Each word is assigned to a topic according 
to the co-ocurrence words in the context (document) 
and the topics assigned to this word in other docu¬ 
ments. In formal terms, a topic is a distribution on 
a fixed vocabulary. We have applied the EDA to the 
WikiNews corpusjl Each topic in this corpus is rep¬ 
resented using the twenty most prominent words. 


4 Architecture of the System 

Our approach to build timelines from written news 
in English implies event coreference resolution by 
applying three cluster processes in sequential order: 
a temporal cluster, a lemma cluster, and a topic clus¬ 
ter. It combines various resources: 

• Named entity recognition, using OpeNER web 
servicesJl 

• TimeME automatic annotation of texts using 


TipSEM system dElorens et ah, 20101 ). 

The NET^ verb lemmatizer based on Word- 
Net dEellbaum, 1998| ). 

The SENNA dCollobert et ah, 2011 1 ) Semantic 
Roles Eabeling. 


http://gplsi.dlsi.ua.es/demos/TIMEE/ 
^https://dumps.wikimedia.org/enwikinews/ 
’http://WWW.opener-project.eu/webservices 
"'http: //www .nltk.org/ 


































• The LDA Topic Modeling algorithm, using 
MALLET dMcCallum, 20021 ). 

4.1 Target Entity Filtering 

If the target entity filtering is to be performed then it 
is first necessary to resolve the named entity recog¬ 
nition and coreference resolution. This is done by 
integrating the external OpeNER web services into 
our proposal. More specifically, the components 
applied in our proposal are the NER component]^ 
which identifies the names of people, cities, and mu¬ 
seums, and classifies them in a semantic class (PER¬ 
SON, EOCATION, etc.) and the coreference resolu¬ 
tion component]^ whose objective is to identify all 
those words that refers to the same object or entity. 

Only those events that are part of sentences con¬ 
taining the target entity or a coreference entity of the 
target will be selected for the final timeline. 

4.2 Temporal Clustering Approach 

A plain text was considered and we use the TIPSem 
system to automatically annotate all the temporal 
expressions (TIMEX3), events (EVENT) and links 
between them. The TEINKS annotated in the text 
are used in order to extract the time context of 
each event and make it possible to infer both time 
at which each event occurs and the temporal or¬ 
dering between the events in the text. Moreover, 
if we are able to determine the time of the event, 
we will be able to determine temporal compatibil¬ 
ity between events, even when they are contained 
in different documents, thus signifying that cross¬ 
document event coreference resolution is also possi¬ 
ble. 

In this first step, all the events from the differ¬ 
ent documents that occurring on the same date will 
therefore be part of the same cluster. The clusters are 
positioned in ascending ordered based on the date 
assigned. 

4.3 Semantic Clustering Based on Lemmas 

Once all the events that share temporal information 
and the target entity have been grouped together, we 
apply a simple clustering based on head word lem¬ 
mas. This lemma-based clustering groups together 
all event mentions with the same head word lemma, 

'http://opener.olery.com/ner 

^http://opener.olery.com/coreference 


the same temporal information and the same target 
entity. We therefore assume that all these event men¬ 
tions corefer to the same event. This is our Run 1 at 
the competition. 


4.4 Semantic Clustering Based on Topics 


The problem of the lemma-based cluster is that it 
does not take into account the argument structure of 
the event. This last clustering therefore attempts to 
solve this problem by extracting the semantic roles 
from each event and representing their meaning by 
using topics on a reference corpus. This approach 
has three steps: 

1. Using SENNA dCollobert et ah, 201 1| | as Se¬ 
mantic Roles Eabeling, we have detected roles 
AO and All] which are related to the event 
mention head word. Eor each role we extract 
only the nouns. 

2. We have extracted 500 topics from WikiNews 
using Topic Modeling with MAEEET. All 
these topics are used as a knowledge base. We 
will use only the most representative words for 
each topic (the twenty words with the greatest 
weight) and the weights that they have in each 
topic. 

3. Einally, we have created an event-topic matrix. 
Each event (raws) is represented by a vector. 
The values of the vector are the addition of 
weights of each argument noun in each topic 
(columns). 

Eor example, if the nouns in arguments AO and 
A1 are “users, problems, phones”, we represent their 
meanings according to the topics tn assigned to them 
by applying EDA to WikiNews {user = foUsUs; 
problems = tQ,t 2 , phones = f 5 ,f 6 > etc)- Then, 
the event e of this sentence is represented by a n- 
dimensional vector in which n is the amount of 
topics (500) and whoses values are the addition of 
weight of each noun in each topic T„. 

In order to group together similar event mentions, 
we have applied a k-means clustering algorithm to 
these event vectorsj] The distance metric used has 


’in order to represent Semantic Roles, SENNA 
uses the tag set proposed by Proposition Bank Project 
(http://verbs.Colorado.edu/"mpalmer/projects/ace.htmll 
AO and At represent the main roles related to each verb. 

*Note that it has been applied only to the events previously 
clustered following the lema-based approach (Run 1). 









been Euclidean Distance. The number of cluster 
has been adjusted to twoH Therefore, each cluster 
with the same head word lemma, the same tempo¬ 
ral information and the same target entity is then re¬ 
clustered according to the similarity of the main top¬ 
ics of its arguments. This cluster corresponds to our 
Run 2 at the competition. 

5 Evaluation Results 

SemEval-2015 Task 4 consists on building timelines 
from written news in English in which a target entity 
is involved. The input data provided by the organiz¬ 
ers is therefore a set of documents and a set of target 
entities related to those documents. Two different 
tracks are proposed in the task, along with their sub¬ 
tracks: 

• Track A: This consists of using raw texts as in¬ 
put and obtaining full timelines. Subtrack A 
has the same input data, but the output will be 
the timeEines of only ordered events (no as¬ 
signment of time anchors). 

• Track B: This consists of using texts with man¬ 
ual annotation of events mentions as input data. 
Subtrack B has the same input data but the out¬ 
put will be timeEines of only ordered events. 

In the Semeval-2015 Task 4 competition we have 
participated in Track B and Sub track B. The results 
for the Micro-average E-score measure obtained by 
our approach in the competition are shown in Table 

ffl 


TRACK 

Corpus 1 

Corpus! 

CorpusS 

Total 

Tracks-R1 

22.35 

19.28 

33.59 

25.36 

Tracks-R2 

20.47 

16.17 

29.90 

22.66 

SubTrackS-Rl 

18.35 

20.48 

32.08 

23.15 

SubTrackS-R2 

15.93 

14.44 

27.48 

19.18 


Table 1: Results for GPLSIUA Approach. 


Although the Micro-EScore results are not very 
high, the results obtained by our approach are the 
highest in all of the corpus evaluated by the organiz¬ 
ers. Our approach obtained an improvement of 7% 
compared with the other participant in Track B and 
a 6.48% in Subtrack B. 

®We have used PyCluster tool: 

https://pypi.python.org/pypi/Pycluster 


6 Conclusions 

The results show that our approach is suitable for the 
task in hand. On the one hand, temporal information 
is automatically extracted with a temporal informa¬ 
tion processing system which makes it possible to 
infer and determine the time at which each event has 
occurred. On the other hand, the semantic similar¬ 
ity based on the verb is sufficient to group together 
coreferent events. 

The basic method (Run 1), consisting of search¬ 
ing for similar verb lemma, eventually proved to be 
the best. We have therefore carried out an in-depth 
analysis of the results obtained for Run 2 and have 
observed three main drawbacks in the Topic Model¬ 
ing approach: 

• The K-means algorithm forces us to fix fhe 
number of clusfers beforehand, and fhis has 
been fixed af 2. However, fhere is offen only 
one correcf clusfer. Anofher approach wifhouf 
a fixed number of fopics will improve fhe ap¬ 
proach. Bejan and Harabagiu (I2014II . for ex¬ 
ample, suggesf inferring fhis value from dafa. 

• The represenfafivify of each evenf menfion de¬ 
pends direcfly on fhe amounf of fopics ex- 
fracfed from fhe reference corpus. Many fop¬ 
ics will produce excessive granularify, and few 
fopics will be unrepresenfafive. We have sef 
fhe number of fopics af 500, buf if is necessary 
fo sfudy whefher anofher amounf of fopics will 
improve fhe resulfs. 

• This approach depends excessively on fhe rep¬ 
resenfafivify of fhe reference corpus. We be¬ 
lieve using larger corpora should improve fhe 
resulfs. 

As Eufure work, we plan fo use ofher similarify 
measures and clusfering algorifhms in an affempf 
fo solve fhe problem of previously fixed number of 
clusfers. We also plan fo evaluafe using differenf 
Topic Modeling configurafions. 
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